44 research outputs found

    Direct NN-body code on low-power embedded ARM GPUs

    Full text link
    This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are heterogeneous architecture where computing power is supplied both by CPUs and GPUs, and are emerging as a possible low-power and low-cost alternative to clusters based on traditional CPUs. A state-of-the-art direct NN-body code suitable for astrophysical simulations has been re-engineered in order to exploit SoC heterogeneous platforms based on ARM CPUs and embedded GPUs. Performance tests show that embedded GPUs can be effectively used to accelerate real-life scientific calculations, and that are promising also because of their energy efficiency, which is a crucial design in future exascale platforms.Comment: 16 pages, 7 figures, 1 table, accepted for publication in the Computing Conference 2019 proceeding

    Extending promela and spin for real time

    Full text link

    Reduced instruction set computer architectures for VLSI

    No full text

    Credit-Flow-Controlled ATM versus Wormhole Routing

    No full text
    : ATMhas been adopted as the main high speed technology in both wide and local area networks. When ATM is combined with credits-the flow control mechanism that is particularly suitable for local data communication- it becomes appropriate for multiprocessor interconnection networks as well. Actually,credit-flowcontrolled ATM has similarities with wormhole routing, one of the most popular architectures for MP networks: they both use credits and fixed size cells/flits, and their hardwarecomplexity is comparable. In this paper,weshow that ATM with credits performs better than wormhole routing, because ATM uses lanes moreefficiently: ATMprovides high throughput and low latency with much less buffer space than that required by wormhole routing; also, ATM demonstrates little sensitivity to bursty traffic, and, unlike wormhole, it is fair in terms of latency in hot-spot configurations. Our simulation uses detailed and realistic switch models, which operate at clock-cycle granularity and track ..

    Reduced instruction set computers

    No full text

    VISA: A variable instruction set architecture

    No full text

    A systematic evaluation of emerging mesh-like CMP NoCs

    No full text
    This paper studies alternative Network-on-Chip architectures for emerging many-core chip multiprocessors, by exploring the following design options on mesh-based networks: Multiple physical networks (P), cores concentration (C), express channels (X), it widths (W), and virtual channels (V). We exhaustively evaluate all combinations of the afore-mentioned parameters (P, C, X, W, V), using the energy-throughput ratio (ETR) as a metric to classify network congurations. Our experimental results show that, on one hand, with an appropriate selection of parameters (V,W), an optimized baseline 2D mesh offers the best possible ETR for NoCs with up to a few tens of cores (64-core NoC). More complicated networks, using concentration and express channels, can reduce the zero-load latency, but do not necessarily help to improve ETR. On the other hand, for larger CMPs, a 2D mesh with multiple physical networks is a better option: once optimized, this architectural choice can reduce the ETR by up to 46% for 256 cores

    Receive-Side Notification for Enhanced RDMA in FPGA Based Networks

    No full text
    FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming too much FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport- a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip latency. The entire mechanism is designed to sit within the fabric of the FPGA, requiring no software intervention. Our solution is able to reduce the latency of a receive operation by around 20% for small message sizes (4KB) over a single hop (longer distances would experience even high improvement). Results from synthesis over a wide parameter range confirm that this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size

    Synchronization support in I/O adapter based SCI clusters

    No full text
    corecore